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POSTERIOR CONVERGENCE RATES OF DIRICHLET 
MIXTURES AT SMOOTH DENSITIES 

By Subhashis Ghosal 1 and Aad van der Vaart 

North Carolina State University and Vrije Universiteit, Amsterdam 

We study the rates of convergence of the posterior distribution 
for Bayesian density estimation with Dirichlet mixtures of normal 
distributions as the prior. The true density is assumed to be twice 
continuously differentiable. The bandwidth is given a sequence of 
priors which is obtained by scaling a single prior by an appropriate 
order. In order to handle this problem, we derive a new general rate 
theorem by considering a countable covering of the parameter space 
whose prior probabilities satisfy a summability condition together 
with certain individual bounds on the Hellinger metric entropy. We 
apply this new general theorem on posterior convergence rates by 
computing bounds for Hellinger (bracketing) entropy numbers for the 
involved class of densities, the error in the approximation of a smooth 
density by normal mixtures and the concentration rate of the prior. 
The best obtainable rate of convergence of the posterior turns out to 
be equivalent to the well-known frequentist rate for integrated mean 
squared error ?i _2//5 up to a logarithmic factor. 

1. Introduction. Kernel methods for density estimation have been in use 
for nearly fifty years. Bayesian kernel density estimation using a Dirichlet 
process on the mixing distribution has been considered more recently (cf. 
[5, 12]), where the density is viewed as a mixture of normals with an arbi- 
trary mixing distribution and a Dirichlet process (cf . [4] ) is used as a prior on 
the mixing distribution. Efficient Gibbs sampling algorithms for the compu- 
tation of the posterior based on a Dirichlet mixture process have been devel- 
oped; see, for instance, [3]. Under certain conditions, posterior consistency 
of such a Dirichlet mixture prior with a normal kernel has been obtained 
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by Ghosal, Ghosh and Ramamoorthi [7]. Ghosal and van der Vaart [9] ob- 
tained rates of convergence of the posterior for the Dirichlet mixture in the 
case that the true density is a location or location-scale mixture of normals 
with standard deviations bounded away from zero and infinity. Under natu- 
ral conditions on the prior, they showed that the posterior converges at rate 
(log n) K I 'y/n, where k depends on the tail behavior of the base measure of the 
Dirichlet process. The rate of convergence was obtained by finding a sharp 
entropy estimate and prior concentration rate for this problem and then ap- 
plying the general posterior convergence rate theorem of Ghosal, Ghosh and 
van der Vaart [8]. The fast rate of convergence (logn) K / \fn arises because a 
mixture of normals with standard deviations bounded by two positive num- 
bers is "super-smooth." Super-smooth densities can be approximated by 
kernel estimators with a bandwidth that approaches zero at a logarithmic 
rate and super-smooth mixtures can be well approximated by finite normal 
mixtures with a small number of components (cf. Lemma 3.1 of [9]). This 
leads to small entropy numbers and high prior concentration (comparable 
to those of finite-dimensional models) with a nearly parametric convergence 
rate as a consequence. As a consequence of entropy bounds for normal mix- 
tures, Ghosal and van der Vaart [9] also obtained essentially the same con- 
vergence rate (log ra) /-^/n for sieved maximum likelihood estimators (MLE). 
Under the same super-smoothness condition, Genovese and Wasserman [6] 
earlier obtained the much weaker convergence rate n- 1 /6(i ogn )(i+5)/6 f or 
some 5 > for sieved MLEs based on Gaussian mixtures. 

While it is interesting to observe nearly parametric rates of convergence, 
the super-smoothness of the true density with a bounded known range for 
the standard deviation is a restrictive assumption. Scricciolo [13] considered 
the situation where the true density is still super-smooth, but the prior for 
the bandwidth parameter contains zero in its support. The resulting rate 
of convergence is much slower in this case and depends on the decay rate 
of the prior for the bandwidth at zero. In this paper, we consider the more 
realistic situation where the density of the observations is smooth, but may 
not be a mixture of normal densities. A smooth density can be approximated 
by mixtures of normals, but it is necessary to let the bandwidth (standard 
deviations of the components) tend to zero and allow an increasing number of 
components. This increases the complexity of the model and leads to larger 
entropy and smaller prior concentration, with, as a consequence, a slower 
rate of convergence of the posterior distribution. 

More specifically, we assume that the density of the observations is twice 
continuously differ entiable. Under some regularity conditions, the optimal 
rate of convergence of a kernel estimator is then n _2//5 . The main purpose of 
this paper is to establish that the posterior distribution based on a Dirichlet 
mixture of normal prior attains the same rate, up to a logarithmic factor. 
In addition, we obtain the same rate for the sieved maximum likelihood 
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estimator using a sieve consisting of normal mixtures. It may be noted that, 
even though the estimation of a smooth density is a considerably harder 
problem than that of a super-smooth density, our obtained rate, which is 
nearly optimal for the given problem, is still much better than the n -1 / 6 rate 
Genovese and Wasserman [6] obtained for sieved MLEs in the super-smooth 
case. 

1.1. Notation. Throughout the paper X\,X2,..- are independent and 
identically distributed (i.i.d.) as po on E. The corresponding probability 
measure is denoted by Po- 

The supremum and Li-norm are denoted by || • ||oo and || • ||i, respectively. 
For two density functions /, g:M — > [0, oo), we let h denote the Hellinger 
distance defined by h 2 (f,g) = Jif 1 ^ 2 — g 1 ^ 2 ) 2 dX, where A is the Lebesgue 
measure on R. The e-covering number N(e,S,d) of a semi-metric space S 
relative to the semi-metric d is the minimal number of balls of radius e 
needed to cover S. Similarly, the e-bracketing number TVr.i (e, S, d) is the 
minimal number of e-brackets [f,g] = {u:f <u< g} needed to cover S, 
the size of a bracket [f,g] being the distance d(f,g) between upper and 
lower brackets (cf., e.g., [14]). The logarithms of the covering and bracketing 
numbers are referred to as entropies without and with bracketing. 

We write "<" for inequality up to a constant multiple, where the constant 
is universal or (at least) unimportant for our purposes. An expression x a+ 
in a statement means that the statement holds for x a for any a' > a. Let 
4>{x) = (2-7r) -1 / 2 exp(— x 2 /2), the standard normal density, and let <j) a (x) = 
a^ 1 4>(x I a) . An asterisk denotes convolution and pf,ct = F* <p a is a Gaussian 
mixture with mixing distribution F. The distribution which is degenerate 
at 6 is denoted by 5g. The support of a density p is denoted by supp(p). 

1.2. Assumptions. Throughout the paper, we assume that h(po,po*(j) a ) = 
0(a 2 ) as a — > 0. If po is twice continuously differentiable with J(p'q/po) 2 x 
Po dX < oo and / (po/po) 4 po d\ < oo, then the condition holds (cf. Lemma 4). 

1.3. Organization. The main results of the paper are on the convergence 
rate of the posterior distribution and these are presented in Section 2. The 
proofs of the main theorems are contained in Sections 9 and 10, and are 
based on estimates of the entropies of normal mixtures obtained in Section 5, 
approximation lemmas given in Section 6 and lower bounds on Dirichlet 
probabilities obtained in Section 7. A general result on posterior convergence 
rates is obtained in Section 4, which is subsequently used in the proof of the 
main result in Section 2. The entropy estimates also have applications to 
rates of convergence of sieved MLEs and posterior distributions relative to 
sieved priors, as noted in Section 3. The proofs of the theorems in Section 3 
are given in Section 11. Sections 4-8 may be of some independent interest. 
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2. Main results. We consider the sequence of priors U n for p defined 
structurally as follows: 

• PF,a(x) = J 4>o{x - z)dF(z); 

• F ~ D a , the Dirichlet process with base measure a = a(M)a, where < 
a(R) < oo and a is a probability measure; 

• a ja n ~ G, where a n is a sequence of positive real numbers converging to 
zero with n~ ai < a n < n~ a2 for some < 02 < «i < 1 and G is a fixed 
probability distribution on (0, 00) satisfying G(s) < e~@ s 7 as s — ► and 
1 — G(s) < e _/3s7 as s — > 00 for some 7 > 1 and f3 > 0. 

• F and er are independent. 

UU n (p:d(p,po) > Me n \X\, . . . ,X n ) — > in P^-probability for some M > 0, 
we say that e n — > is (an upper bound for) the posterior convergence rate 
relative to d. 

The proof of the following posterior convergence theorem is given in Sec- 
tion 9. 

Theorem 1. Suppose that po has compact support and that 02 > (4 + 
7) -1 . If the base measure a has a continuous and positive density on an 
interval containing supp(po); then the posterior rate of convergence relative 
to h is 

(2.1) en =max{(na n rV2( logn))n -i/2 ((J -i ) (7/2(7-i))4- )fJ 2 logn} _ 

//supp(po) ^ a finite union of intervals and every interval I in the support 
satisfies Po(I) ^ ^(I) a f or some a > 0, then this can be improved to the rate 

(2.2) e„ = max{(r l( 7 n )- 1 / 2 (logr l ),n- 1 / 2 ( (7 - 1 )(7/2(7-i))+ )(T 2 } 

Further, when G is compactly supported, the middle terms on the right-hand 
side of (2.1) and (2.2) can be omitted. 

The best rate ra -2 / 5 (logn) 4 / 5 in the preceding theorem is obtained in the 
second assertion with 7 = 00 (i.e., G is compactly supported) if a n is chosen 
to be n -1 / 5 (logn) 2//5 , nearly equal to the optimal frequentist bandwidth 
choice n -1 / 5 . 

A common practice is to consider an inverse gamma prior on a 2 , which 
leads to conditional conjugacy and hence to an efficient Gibbs sampling pro- 
cedure. Unfortunately, our theorem does not apply to this prior, because the 
inverse gamma prior has a polynomially decaying tail near infinity. Indeed, 
even with faster-than-exponential decay, the theorem indicates that rates 
may suffer whenever the support of the prior is noncompact. Because these 
rates are only upper bounds, a negative conclusion cannot be reached based 
on these. However, it may be mentioned that even the issue of consistency is 
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open for the inverse gamma prior unless an upper truncation is used (cf. [7]). 
On the other hand, for a truncated inverse gamma prior, a nearly optimal 
convergence rate is obtained from Theorem 1, while Gibbs sampling can 
be implemented easily with an additional acceptance-rejection step to take 
care of the truncation. 

The preceding theorem will be obtained by applying the general poste- 
rior convergence rate theorem in Section 4. Estimates of entropy and prior 
concentration rate obtained in [9] for the super-smooth case will be refined 
in a way suitable to the present situation in Section 5. 

The assumption that po is compactly supported is restrictive, in particular 
in combination with the assumption that h(po,po*(j) a ) = 0(o~ 2 ), which forces 
Po to tend to zero smoothly at the boundary points of its support. We do not 
know if the assumption of compact support can be completely removed, but 
we note the following extensions of the preceding theorem, which increase 
the applicability considerably. 

Given a smooth function ioiR— »• [0,1] with compact support, we can 
form a reduced data set X±, . . . ,X n by rejecting each Xi independently with 
probability 1 — w(Xi), giving a sample from the density po = pow/ J powdX. 
The size n is distributed binomially with parameters n and / powdX, whence 
n/n— ► / powdX a.s. Conditionally on n, Theorem 2.1 of [8] can be applied 
to conclude that the posterior concentrates on Hellinger balls of radius En 
around po. If we choose w to be equal to 1 on a given compact then po and 
Po are proportional on this compact and hence the posterior essentially gives 
the (conditional) density of the original observations on this compact. 

This construction may be appropriate for Bayesian estimation of heavy- 
tailed densities, but it does change the posterior distribution. Even though 
we may expect that the change on an interval where w is identically one is 
minimal, it appears to be difficult to bound the difference. This difficulty 
can be avoided by applying the preceding with a sequence of truncation 
functions. For densities with exponentially decreasing tails, it yields a rate 
of convergence of the posterior relative to the Hellinger (semi)-distance on 
compact intervals given by h%,(p,q) = f_ k {p 1 ^ 2 — q 1 ^ 2 ) 2 dX. For simplicity, in 
this result we assume that G is compactly supported in (0,oo). The proof 
of the following theorem is contained in Section 10. 

Theorem 2. Suppose that po satisfies Pq[— a, a } c < e - ca for some posi- 
tive numbers c and 7, and is twice continuously differentiate with f (p'0/po) 2 x 
PodX < 00 and J(p'o/po) 4 PodX < 00. If the base measure a has a continuous 
and positive density a' satisfying ct'(t) > e~ dtl for sufficiently large \t\, for 
some positive constant d, then the rate of convergence relative to the semi- 
distance hk is at least 



(2.3) 



e n = max{(no-„) 1 / 2 (log n) 1+7 / 2 , a 2 n log n}. 
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3. Sieve maximum likelihood and sieve priors. As a byproduct of the 
upper bounds on the entropy of the set of normal mixtures (necessary for 
the proofs of our main results), we can also obtain the rate of convergence 
of sieved MLEs for normal mixtures. We consider sieves of the types 

(3.1) V n = {PF,a ■ F[-a n ,a n ] = 1, b x o n <a< b 2 o- n }, 

(3.2) V n = {p F:CT :F[-a,a] c < A(a) for all a > 0,ha n < ° < b 2 a n }. 

Here, a n and a n are positive sequences and A: (0, oo) — > [0,1] is decreasing. 
We define the sieved MLE as p n = argmax{n™=iP(^i) 'P £ V n }. 

The rate of convergence of sieved MLEs relative to h can be obtained from 
Theorem 4 of [16] or Theorem 3.4.4 of [14]. There is a trade-off between 
the complexity of the model V n and the distance of V n to pq. Under the 
assumption of Section 1.2, the approximation rate is 0(o" 2 ). The complexity 
of the model V n can be measured through its bracketing entropy. The rate 
of convergence is the maximum of the approximation error and the solution 
e n to the equation 

(3.3) J £ J Vlog N^e^hjde ~ yfa?. 

Theorem 3. Let a n — > and a n > e so that logn < log(a n /a n ) < logn, 
and let p n be the sieved MLE relative to V n given by (3.1). Ifpo has compact 
support and [—a n , a n ] D supp(po) for all sufficiently large n, then h(p n ,po) = 
P {e n ) for 

(3.4) e n = max{ (ncr n )~ 1/2 a n log n,al}. 

The apparently best rate n _2//5 (logn) 4//5 is obtained when a n is bounded, 
but [—a n ,a n ] D supp(po) and a n ~ n~ 1 / 5 (logn) 2 / 5 . The optimal order of 
bandwidth for the classical kernel estimator o~ n ~ n _1//5 leads to a slightly 
larger error rate n _2//5 logn. Admittedly, these are only upper bounds. In 
particular, the logarithmic factor may not be sharp. 

When supp(po) is not compact, but po/(po * 4>cr„) are uniformly bounded, 
we can use the sieves (3.2) to derive the convergence rate. The condition 
holds, for instance, if po is increasing on (— oo,a], bounded below on [a, b] 
and decreasing on [b, oo) for some a < b (cf. Lemma 6 in Section 6). 

Theorem 4. Let a n — > so that logn < log a" 1 < logn and let p n be 
the sieved MLE relative to Vn given by (3.2) with A(a) = e~ da 1 , d,5>0 
constants. Lf Po[— a, a] c < A(a) for every a > 0, then h{p n ,po) = O p (e n ) for 

(3.5) e n = max{(na n )- 1 / 2 (logn) 1+ ( lv2<5 )/ 4 , C T 2 }. 
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The proofs of the theorems in this section are given in Section 11. 

The sieves V n in (3.1) and (3.2) of the preceding section can also be 
used to construct a prior for which the posterior converges at the same 
rate e n as obtained in Theorems 3 and 4. As in Theorem 3.1 of [8], take 
a minimal collection of Hellinger e n -brackets that cover V n . Consider the 
uniform prior II n on the renormalized upper brackets. Then the resulting 
posterior converges at the rate e n . 

4. A general result on posterior convergence rates. When the prior G on 
a/a n is not compactly supported, existing results on posterior convergence 
rates (such as Theorem 2.1 of [8]) do not seem to suffice in deriving the 
rate. Below we obtain a posterior convergence rate theorem where we use a 
countable decomposition of the space of densities together with conditions 
on their prior probabilities and entropy numbers with respect to h. 

Let X\,X2, ... be i.i.d. with density pG'P. Let Tl n be a sequence of priors 
on V and let po and Pq stand for the true density and the true probability 
measure, respectively. Let d be a metric which induces convex balls and is 
bounded above on V by a multiple of h. 

Theorem 5. Suppose that V n cV is such that U n (V^\Xi, . . . , X n ) — > 
in Pq -probability. Assume that V n can be partitioned as Uj^=— oo suc -h 
that, for a sequence e n — > with ne 2 n — ► oo, 

oo 

(4.1) Yu ^N(e n ,V nd ,d)Vll n (V nd )e- n£ * - 0, 

j=-oo 

(4.2) U n (p:P log(p /p) <4,P WWp) <4) >e" n4 - 
Then H n (p £ V : d(po,p) > 8e n \Xi, . . . ,X n ) — ► in Pff -probability. 

Theorem 5 contains a standard posterior convergence theorem (cf. Theo- 
rem 2.1 of [8]) as a special case where V n is not decomposed (i.e., V n< o = V n 
and V n ,j = for j ^ 0), so that logN(e n ,V n ,d) needs to be bounded by a 
small multiple of ns^ in order to satisfy (4.1). At the other extreme, if we 
decompose V n sufficiently finely so that each V n ,j has diameter less than 
e n , then the covering numbers appearing in (4.1) are all 1 and hence (4.1) 
reduces to 

oo 

(4.3) Yl VlU^Oe-^-O. 

j=-oo 

The trade-off between entropy and summability of the square roots of prior 
probabilities is interesting and requires further investigation; see [15] for a 
consistency result based on the summability condition. 
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To prove Theorem 5, we need two auxiliary results. Ghosal, Ghosh and 
van der Vaart [8] used this result with a = (3 = 1. 

Lemma 1. For any convex set Q of probability measures with inf{/i(Po, Q) : 
Q £ 2} > e, any a, f3 > and all n > 1, there exists a test (f> n = (j) n (X\, . . . , X n ) 
such that 



sup (qP >„ + pQ n (l -<f>n))< Vape 



-ne 2 /2 



Proof. The proof follows by a minor adaptation of a result in [11], 
pages 475-479, as in [10] . □ 

Corollary 1. For any set of probability measures Q with mi{d(Po, Q) : 
Q £ Q} > 4e, any a, (3 > and all n > 1, there exists a test (b n such that 

P^n < J -N(e, Q, d) e ~ U£ sup Q n (l - 4>n) < M e~ n£ \ 

V a 1 - e~ ne QeQ V P 

Proof. For a given j G N, choose a maximal je/2-separated set of points 
in Sj = {Q eQ:je < d(Q, P ) < (j + l)e}. This yields a set Sj such that the 
union of the balls of radius je/2 centered at these points covers Sj. Any such 
ball B is convex by assumption and satisfies h(Q,Po) > d(Q,Po) ^ i e /2 for 
all Q G B. Because any given ball of radius je/4 can contain at most one 
point of S'j, it follows that #S$ < N(ej/A,Sj,d) < N(e,Q,d) for j > 4. (If 
Sj is empty, take Sj empty and adapt the following in the obvious way.) 

For every P\ S Sj, there exists a test uj with properties as in Lemma 1, 
with Q equal to the ball of radius je/2 centered at P\. Let (b n be the maxi- 
mum of all tests attached in this way to some point P\ £ Sj for some j > 4. 
Then for all j > 4, 

00 [~R IR —2ne 2 



e -ne 2 ' 



SUp Q n (l - Cb n ) < SUp /| e -^ 2 /8 < /| e ~2ne\ 

QeU^Si i>n P VP □ 

Proof of Theorem 5. Clearly, we may assume that the prior charges 
only V n . The event A n that /n"=i(? , (^)/Po(^)) dn n(p) > e _3ne - satisfies 
PQ (A n ) — > 1 by Lemma 8.1 of [8] and assumption (4.2). Now, for arbitrary 
tests 4> n ,j, we have 

P n [n n (P £ V n ,j : d(P, P )>8e n \X u .. . ,X n )l An ] 



CONVERGENCE RATES OF MIXTURES 



9 



< p > nj + p " ( (i - ^ f n ^~ty~\ dii ^A e 

V J{PeP„, ;J :d(P,Po)>8e„} J t = iPo(Ai) J 

< P^ nyj + sup P n (l - 0n,i)n n (P nj -)e 3re£ « , 

PeP„,j : rf(P,P )>8£„ 



which can be bounded by a multiple of 



j N{2e n ,V n)3 



,d)e 



-Ami 



+ 



-4rt£f 



nn(7^ n j')6 



3ne; 



for the choice of n j obtained from Corollary 1 with e = 2e n , Q = {P € 
V n j:d(P ,P) > 8e n } and any ctj,(3j > 0. Put ay = N(2e n ,V n j,d), (3j = 
IL n (V n j) and sum over j to obtain the result in view of (4.1). □ 



5. Entropy estimates. In this section we estimate the entropy of normal 
mixtures, paying special attention to components with small variances. The 
main idea is to approximate general normal mixtures by finite mixtures 
with a small number of components. This same device will also be used 
in Section 9 to estimate the prior probabilities of Kullback-Leibler type 
balls and is isolated in the following lemma. The lemma is based on the 
corresponding one for the super-smooth case (cf. Lemma 3.1 of [9]) and a 
partitioning argument. 



Lemma 2. Let < e < 1/2 and a, a > be given. For any probability 
measure F on [—a, a], there exists a discrete probability measure F' on [—a, a] 
with fewer than D(aa~ l V l)loge _1 support points, where D is a universal 
constant, such that 

\\PF,a ~ PF',a\\oo < ~, \\PF,<t ~ PF',a \\ 1 < ^(log E^ 1 ) 1/2 . 

0~ 

Proof. We can partition the interval [—a, a] into k = [2a/cr\ disjoint, 
consecutive subintervals of length a and a final interval Ik+i of 

length smaller than a. We may write F = Ya=i F{h)Fi, where each F{ is 
a probability measure concentrated on Ij, and then pp a = Y^i=\ F{h)PFi,a- 
For ease of notation, let Z{ be a random variable distributed according to Fi, 
and for the left endpoint of ij, let Gi be the law of Wi = {Zi — a^jo . Thus, 
Gi is a probability measure on [0, 1] for i = 1, . . . , k and on [0, Zfc+i/cr] C [0, 1] 
for i = k + 1 . 

By Lemma 3.1 of [9], there exist probability measures G[ on [0,1] with 
Ni < loge" 1 support points such that ||pGi,i —Pg'-,i\\oo < £• For i = k + 1, the 
measure G\ can be taken to be supported on [0, lk+i/°~}- Lemma 3.3 from 



10 S. GHOSAL AND A. VAN DER VAART 

the same paper then shows that ||pg;,i — Pg'.iIIi ^5 ^(loge -1 ) 1 / 2 . Let F- be 
the law of o< + aW[ if W( has law G\ and set F' = Ei=i F(Ii)F(. Because 

PF r ,a{x) = Ecj) a (x - Zi) = cr _1 E(/)((x - m)/a - Wi) = a^pa^iiix - a^/cr), 
and similarly for F[ and G^, we have 

\VF it a -Pf>IIoo = e>" _1 ||PG i! i -Pg;,iIIoo, 
Ibfi.cr -Pf/,<tIIi = llm.i -Pg;,i Ill- 
Combined with ||pf,o- — PF',a\\ < Z)i=i -^(-^i) IIP-Fijir ~~ Pf.',o-||) this shows that 
Pf' a bas the required distances to PF,a- The number of support points of 
F' is bounded by the number of intervals k + 1 times the maximum number 
of support points of an F/, and hence is bounded by a multiple of (aa^ 1 V 
l)loge -1 . □ 

For given numbers a, 61,621 let 
(5.1) ^0,61,62 = {PF,a-F[-a,a\ = l,b 1 <a< 6 2 }. 

Lemma 3. Let < 61 < 62 <ind a > and define Vafafo by (5.1). Then 
for < e < 1/2 and d equal to the L\-norm, 

lOgiVfcJV.d) < i„ g (^) + (£ + 1) (i„ g I) (log! + l0g(£ + 1)). 

For d= || • ||oo ; £/ie same bound holds with e replaced by eb% if eb\ < 1/2. For 
d = h, the same bound holds with e replaced by e 2 . 

Proof. Let a,/3 < 1/2,7 an d <5 be given positive numbers. Fix a mini- 
mal a-net E over the interval [61, 62] and let F be the set of discrete proba- 
bility distributions on [—a, a] with at most N < D(ab^ 1 V 1) log/3" 1 support 
points, for the constant D of Lemma 2. For every a G [61,62]) there exists 
a' G E with |o~ — o~'| < a, whence 

\\PF,a ~ PF,a'\\oo < \\4>a ~ <MU < W ~ Cr'\/(cF A a') 2 < <x/b\, 

\\PF,a -PF,a'\\i < \\4>a - <j>a<\\i < W - o-'\/(a A a') < a/61. 

By Lemma 2, for a sufficiently large -D, there exists, for every given probabil- 
ity measure F on [—a, a], an element F' G F (possibly depending on a') such 
that \\p F ,a> -PF',a>\\oc<(3/bi and ~PF',a'\\i < ^(log/3^ 1 ) 1 / 2 . Hence, 

||PF )CT -PfvIIoo <«/6? + /3/6i, 

WPF,a ~ PF',*> 111 <«/&! + ^(log/T 1 ) 172 - 
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Thus, V = {pF,a- (F,a) £ JFx £} is an e-net over V a ,b lt b 2 f° r II ' lloo and || • ||i, 
respectively, if the expressions above are made to be less than e. 

We next construct a finite net over V by restricting the support points and 
weights of F to suitable grids. For a fixed 7-net 5 over the iV-dimensional 
simplex for the £i-norm, let V' be the set of all pF,a £ "P such that the iV 
support points of F are among the points {0, ±5, ±25, . . .} n [—a, a] with 
weights belonging to S. We may project pp, a G V into PF',a £ T 3 ' by first 
moving the point masses of F to a closest point in the grid {0, ±5, ±25, . . .} 
and then changing the vector of sizes of the point masses to a closest vector 
in S. Let z\, z 2 , . ■ . , z[, z' 2 , . . . ,pi,p 2 , . . . ,p[,p 2 , ■ ■ ■ be such that F = J2pj5 Zj 
and F' = Y J Pj5 z ' j - Then 

\PFA X ) -PF',a{x)\ 

< ^{Pjl^aix - Zj) - <t> a {x - Zj)\ + \pj -p'j\$ a {x ~ Zj)}. 
j 

Thus, \\pF,a-PF',a\\oo i$ <*||<#rl|oo +7ll<Moo ^V b l+7/ & l and \\p F,a ~ PF> ^Wl < 

5/b\ + 7. Hence, V' is a cry-net over V a ,bi,b 2 f° r c a universal constant and 
for 77 = 7700 = a/bf + (3/bi + 5jb\ + 7/61 for || • Hoc and 77 = 771 = a/61 + 
/^(log/? -1 ) 1 / 2 + 5/6! +7 for || • ||i. Because h 2 (f,g) < \\f - g\\\ for any two 
densities, V is a Hellinger c 2 77 2 -net over Ta^fo f° r V = Vl- 

There are at most ((62 — 61) /a) V 1 possible choices of <r £ S. Each 2^ can 
assume at most 2a/ 5 + 1 different values, j = 1, . . . , N . The cardinality of a 
minimal 7-net 5 over the ^-dimensional unit simplex is bounded by (5/7) 
for 7 < 1 (cf. Lemma A. 4 of [9]). Therefore, 

w,<(hzh vl ) x (* + 1 )\' ' ' N 



a J \ 5 J V7AI 
Because N < Dab^ 1 log/? -1 , by construction, it follows that 



log(#P') < log('^i Vl) + (^Vl) log I 



log I + +log 



5 / V7A1 



This number, with a = 5 = &i£, /? = e(loge -1 ) -1 / 2 and 7 = e for given e < 
1/2, is a bound on the D'e-entropy for || • ||i for a universal constant D 1 , 
and with a = 5 = 6 2 e, /3 = 6i£ and 7 = b\e, a bound on the D'e-entropy for 
|| • ||oo is obtained upon simplification. (If D' > 1, we replace e by e/D'.) □ 

For positive numbers a,T,b\ < b 2 , let 
(5.2) ?V = {p F ,a ■■ F[-a, a] = 1, b x r < a < b 2 r}, 

where r is small. 
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Theorem 6. Let b\ < b 2 , t < 1/4 and a > e be given positive num- 
bers and define V a ,r by (5.2). Then, for < e < 1/2 and d the L\-norm or 
Hellinger distance, 

(5.3) log N(e, V a ,r,d) < °- (log ^ (log , 

where the constant in "<" depends on 61,62 only. For d = \\ ■ \\oq, (5.3) holds 
with loge^ 1 replaced by log(er)" 1 . Further, for any of the three metrics, 

(5-4) logA^P^^log^) 2 . 

Proof. Inequality (5.3) can be deduced from Lemma 3 by replacing bi 
and 62 by b\T and 62T and then simplifying the resulting entropy bounds. 

To obtain the bound on the bracketing numbers of V a ,ri we nrs t note 
that H{x) = {b 1 T)- l 4>{x/(2b 2 T))l{\x\ > 2a} + (&ir) - V(0)l{|x| < 2a} is an 
envelope for V a , T - Given an 77-net /i,...,/m for || • W^, the brackets [k,Ui], 
where k = (fi — rj) V and Ui = (fi + 77) A H, cover V a ,r- Thus, n,, — k < 
(2rj) A H and the size of these brackets in L\ can be bounded by 

/ - k) dX < \\ Ui - kW^B + f H(x) dx< V B + (f>(B/2b 2 T), 

J J\x\>B 

for any B > 2a, by the tail bound for the normal distribution. For B = 
(p2 V l)2a(log t?" 1 ) 1 / 2 , we obtain the upper bound equal to a multiple of 
7/(logr7~ 1 ) 1 / 2 a + rf a < 77(log?7" 1 ) 1 / 2 a. Thus, there exists a constant D (pos- 
sibly depending on 61,62) such that the Dry(log?7~ 1 ) 1 / 2 a-bracketing number 
for the Li-norm is bounded by the uniform ^-covering number obtained pre- 
viously. Choose rj = Dea~ 2 (loge~ 1 )~ 1 / 2 for an appropriate constant D and 
simplify to obtain (5.4) □ 

The main difference between the bound given by Theorem 6 and the 
bound when the scale is bounded away from zero (cf. [9]) is the presence 
of the factor ar~ l . This factor is the main driving force for the slower rate 
of convergence of the posterior in the present situation compared to the 
super-smooth case. 

The set of mixtures with an arbitrary mixing distribution on R is not 
totally bounded and hence Theorem 6 can only be extended to mixing dis- 
tributions with possibly noncompact support if the mixing distribution is 
restricted in some other way. We shall extend the theorem to mixtures with 
mixing distributions whose tails are bounded by a given function (such as 
the normal density). 

For a given decreasing function A : (0, 00) — ► [0, 1] with inverse A -1 and 
positive numbers r and 61 < 62, we consider the class of densities 

(5.5) V a ,t = {PF,a ■ F[-a, a] c < A(a) for all a, 6 x r < a < 6 2 r}. 
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For the standard normal distribution function, let A be the function de- 
fined by A(a) = A(a) + Q 2 Ad\ + 1 - $(a). 

Theorem 7. Let b\ < 62 and t < 1/4 be given positive numbers, let 
A: (0,oo) — > [0, 1] be a decreasing function and define Va,t o,s in (5.5). Then, 
for < e < min(l/4, A{e)), we have that 

logiV (3 £ ,TV, || • ||i) < i^f) (log i) (log ^1 

where the constant in "<" depends on 61,62 only. Furthermore, for a con- 
stant c depending on b\ , 62 only, 

logJV M (ce,^, T , || • ||i) < ^ log 



er 

For the entropy relative to h, the same bounds hold with e replaced by e' 2 . 

Proof. Because a e = A~ 1 (e) satisfies F[—a e ,a £ ] c < e for every F as in 
the definition of Va,t> Lemma A. 3 of [9] shows that the Li-distance between 
Va,t an d Va e ,r is bounded above by 2e. It follows that an e-net over V aetT 
is a 3e-net over Va,t- I n view of Theorem 6, this implies the bound on the 
entropy without bracketing given in the first inequality of the theorem. 

To bound the bracketing numbers, we obtain by partial integration, for 
any x > a > 0, 

/•OO 

<f> a {x - z) dF{z) = (1 - F)(a)<j> a {x -a)+ <t>' a {z - x)(l - F){z) dz 



^ At m 1 \ 4>(- x ) A(x/2) 
<A{a)(t )a {x-a) + ^ '- + 



a a 



For x < —a < 0, the same bound is valid for /"^ 4> a {x — z) dF(z), but with 
4> a (x — a) replaced by 4> a {x + a) and with — x replaced by x. Also, for a > 
and x < a, 

•00 

4> a (x — z) dF(z) < 4> a {x — a)F[a, 00) < 4> a {x — a)A(a). 

For x > —a, the same bound is valid for fZ^^uix — z)dF(z), but with 
4> a {x — a) replaced by (f> a (x + a). Hence, for any x and a > 0, 



(j>cr(x - z) dF(z) < {4>b2r{x - a) + <^> 2T (x + a))A{a) 



z\>a 



+ -(<P(x) + A(x/2))l{\x\>a}. 

T 
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Let H denote the appropriate multiple of the right-hand side of this in- 
equality. If F a is the renormalized restriction to [—a, a] of a probability mea- 
sure F, then F[—a, a]pF at(T < pF,a < F[—a, a]pF a ,a + H. Consequently, given 
e-brackets [k , Uj\ that cover V a , T > there exists for every (F, a) , as in the def- 
inition of Va,ti a bracket [Zj, ttj] with Zj(l — A(a)) < pF,a < a > + H < 
m + H . Thus, the brackets [k(l — A(a)),Ui +H] cover Va,t- The size in L\ of 
a bracket [k,Ui] is bounded by \\ui — k\\i + ||Zi||i-A(a) + \\H\\i < e + A{o)/t. 
We now choose a such that A(a) < re and apply Theorem 6 to bound the 
number of brackets [Zj,Uj]. □ 

As an example, if the mixing distributions have sub-Gaussian tails, then 
we can apply the preceding theorem with A equal to 1 — $(a) < <j)(a), whence 
A is bounded by a multiple of the same function. Then, both ^4 _1 (e) and 
A~ 1 (e) are bounded by a multiple of (loge -1 ) 1 / 2 and the (bracketing) en- 
tropy is bounded by a multiple of T~ 1 (log(eT)~ 1 ) 5 / 2 . Provided the tails of the 
mixing distributions are bounded by a function of the form A(a) = e~ da , 
the entropy of the set of mixtures increases at most through a power of 
log(eT)" 1 . On the other hand, polynomially decreasing tails incur an addi- 
tional factor of e~ k in the entropy bounds of Theorem 7. 

6. Approximation results. If po is a twice differentiable density, then for 
d equal to the L p -distance, it is well known that d(po,po * 4> a ) = 0(a 2 ). In 
the following lemma, we establish this for d = h. 

Lemma 4. Let po be a twice continuously differentiable probability den- 



• If KPo/Po) 2 PodX < oo and f(p' /po) 4 Pod\ < oo, then h(p ,p * <f> a ) < a 2 . 

• If po is bounded with f \p'q \ dX < oo, then \\po — po * (j>a\\i ^5 °~ 2 ■ 

In both cases, the constants in "<" depend on the given integrals only. 

Proof. By the assumption of Po-hitegrability of the functions p'o/po and 
Pq/po, we have / \p \ dX < oo for i = 1, 2. Therefore, po and p' are uniformly 
bounded, from which it can be seen that p a {x) = J po(x — ay)(f>(y) dy is twice 
partially differentiable relative to a, with derivatives p a (x) and p a {x) given 



sity. 



by 
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Using Taylor's theorem with the integral form of the remainder (cf. [2], 
page 120), we have 

Because po(x) = — J p'o(x)y(j)(y) dy = for every x, we obtain 

± J \J0 \p s ^ (x) ^ps'o- (x)S / 



+ - 



<ix x (1 — s) 2 (is. 



^pHt 2 (x)/ ^^p% 2 (x) 
Now, for any cr, by the Cauchy-Schwarz inequality, 

pI(x) = ([ f"~ ay) y P T(x-ay)mdy] 2 

V p ' {x-ay) J 

f {p»{x-ay)) 2 4 

< / 7 x-y dyxp a {x). 

J p {x-ay) 

Furthermore, by Holder's inequality with p = 4 and q = 4/3, we have 

pu-) = ( I it~ ay) y ^ x - Dm*)* 

\J p > {x-ay) J 



p' (x - 


- o-y) 


3/4, 


-cry) 


j/ (x- 


-cry) 


3/4, 
Po ( x 


- cry) 


Po(«- 


-cry) 


3/4, 

Po O 


- cry) 



4 

.4 



J y <t>{y) d v xpi(x)- 



The required bound for the proof of the first assertion now follows by sub- 
stituting these inequalities into the expression for the Hellinger distance and 
interchanging the order of integration. 

The proof of the second assertion is similar, but easier. □ 



The following lemma bounds the distance between a normal mixture 
with a mixing distribution F and that with a discrete approximation to F. 
This result, which extends Lemma 5.1 of [9], will be instrumental in lower- 
bounding the prior probability of Kullback-Leibler-type neighborhoods. 

Lemma 5. Let R = UjLo Uj be a partition o/R and F' = J2jLiPj$Zj be a 
probability measure with Zj £ Uj for j = 1, . . . , N. Then, for any probability 
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measure F on K, 

1 1 N 

\\PF,a -PF',a\\oo < ~2 ™ a * N X ( U j) + \ F ( U j) ~ Pj 

1 N 

\\PF,a ~PF',a\\i < - max \(Uj) +Y,\ F ( u j) -Pj\- 
- 3 - i=i 

Proof. Bound Pf,<t{x) —Pf',u{x) by 




+ Y J 4>,{x-z 3 ){F{U ] )-p 3 ). 
i=i 

The result now follows because F(U ) = 1 - EjLi ■F(tfj) < EjLi l-^(^) -Pji 
Halloo <ff _1 , ll^lloo^^- 2 and ||0 ff (.-z)-0 <T (.-/)||i< ( 7- 1 |z-z , |. □ 

Bounds on the Kullback-Leibler divergence require some control on quo- 
tients of the type Po/pF,a- The following lemma, which is implicit in Re- 
mark 3 of [7], is useful for this purpose. 

Lemma 6. Let p be a bounded probability density such that p is nonde- 
creasing on (— oo,a], bounded away from on [a,b] and nonincreasing on 
[6,oo) for some a<b. Then, for every r > 0, there exists a constant C > 
such that p* (j} a > Cp for every a <t . 

The next three lemmas are useful to control the Kullback-Leibler diver- 
gence and similar quantities in terms of the Hellinger distance. The first 
lemma is a simplification of Theorem 5 of [16]. 

Lemma 7. For every b > 0, there exists a constant 8b > such that 
for all probability measures P and Q with < h 2 (p,q) < EbP(p/q) b , with 
log + x = logx V 0, 

Fio g e</ 1 ^, 9 ,{i + Iio g+ ^ + Ii„ g+ p(?)'}, 

p(lo g ?)V(^){l + ilo g+ ^ + ilo g+ p(?)'} 2 . 

Proof. The function r : (0, oo) — > M defined implicitly by log x = 2(- v /x — 
1) — r{x){^fx — l) 2 possesses the following properties: 
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(i) r is nonnegative and decreasing. 

(ii) r(x) ~ logx -1 as x [ 0, whence there exists e' > such that r(x) < 
2 log a; -1 on [0,e'] (a computer graph indicates that e' = 0.4 will suffice). 

(iii) For every b > 0, there exists e'l > such that x b r(x) is increasing on 
[0, e'l]. (For b > 1, we may take e'l = 1, but for b close to zero, e'l must be 
very small.) 

Using these properties and h 2 (p,q) = —2P(y / q/p — 1), we obtain 

\ 2 n 



Plog^ = h 2 (p,q)+P 



r\ — 

P 



11-1 
P 



<h 2 (p,q)+r(e)h 2 (p,q) + P 



U-<e 
PJ IP 



< h 2 (p, q) + 2 log -h 2 (p, q) + 2e b log -P (?) " 

e e \qj 

for e < e' A e'l A 4. The proof of the first inequality now follows by choosing 
e b = h 2 (p,q)/P(p/q) b and e h = (e' A e'l A A) b . 

To prove the second inequality, note that | logx| < 2\y/x — 1|, x > 1, and 

so 



<4P 



4/i 2 (p,g). 



Next, for e < e^ 2 , in view of the third property of r we have 



P 



log-] 1^<1 



<8P| +2P 
p ) 



r 2 \- 

P 



□ 



< 8h 2 (p, q) + 2r 2 (e)h 2 (p, q) + 2e b r 2 (e)P 

_// 

'6/2 



1-1 U1<1 



With e fc = h 2 (p,q)/P(p/qf and e fe < (e' A e^'/ 2 ) b , the proof follows from (ii). 



The next lemma is the limiting case of the preceding lemma as b f oo. The 
first assertion was proved by Birge and Massart ([1], equation (7.6)). The 
second assertion improves on Lemma 8.3 of [8]. 



Lemma 8. For every pair of probability densities p and q, 



Plog^<h 2 (p,q)\ 
log£) 2 <h\p,q)\ 



P 



P 
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Proof. It can be checked that, for b > 1, we can choose e'l = 1 in the 
preceding proof. Furthermore, we can choose e' = 1 if we use the bound 
r{x) < 2 + 21ogx rather than the bound r(x) < 21ogx. This leads to the 
same types of bound as in Lemma 7, which are then seen to be valid for every 
b > 2 and any probability densities p and q with h 2 (p,q) < P(p/q) b , since 
e b > 1. Here, P{p/q) b = Q{p/q) b+1 > (Q(p/q)) h+1 > 1 for 6 > 1 by Jensen's 
inequality. Thus, the bounds of Lemma 7 hold for every sufficiently large b 
and every p and q with h 2 (p,q) < 1. For b t oo, we have that {P(p/q) b ) 1/b 
converges to the L 00 (P)-norm of p/q, and the bounds tend to the bounds 
given by the present lemma. □ 

Given the control of the supremum of likelihood ratios, we can also com- 
pare the Kullback-Leibler divergences of two densities relative to a third 
density. 

Lemma 9. For any probability densities p, q and r, 

1/2 



Plog- < Plog- + 2%,r) 
r q 



P^log^ <4P^log^ +16h 2 (p,q) + 16h 2 (q,r) 
Here, p/r is read as if p = and as oo if r = <p. 



+ I6h 2 (p,r) 



Proof. To prove the first relation, write Plog(p/r) as the sum of P\og{p/q) 
and P\og{q/r). Using logx < 2{^/x — 1) and the Cauchy-Schwarz inequality, 
we obtain 

Plog^<2P^(Vg-v^) 
r \/r 



< 2 


P 


1/2 




r 


oo 




p 


1/2 


< 2 




r 


oo 



J Vp(Vq- Vr)d\ 

h(q,r). 



By the relations log + x < 2\^fx — 1| and log_ x = log + (l/x) < 2\y/l/x — 1| , 
we have, for any probability densities p, q, r, 

\ 2 



P log. 



< 4P 



1 <4 



h 2 (q,r) 



P log. 



< 4P 



ih 2 (p,r). 
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Since | log ^\ < log | + log_ | + log + £ + log_ 2, the second relation follows 
from the triangle inequality for the L2(.P)-norm. □ 

7. Prior estimates. The following extension of Lemma 6.1 of [8] gives a 
useful probability bound. 



Lemma 10. For given iVGN, let (pi, ■ ■ ■ ,pn) be an arbitrary point in 
the N -dimensional unit simplex and let (X\, . . . , Xn) be Dirichlet distributed 



with parameters (a±, . . . , cun), with ctj < 1 for every j and Ylj=i a j = m - Let 
a and b be positive numbers. Then, for every < e < 1/4 with e b < actj and 
eN < 1, and constants c and C that depend only on a,b,m, 

(7.1) Prfe|X i -^|<2 £ , i mm v X j >^ >Ce-***<-\ 



Proof. As in the proof of Lemma 6.1 of [8], we can assume without 
loss of generality that pn > N~ l , and if \xj — pj\ < e 2 for j = 1, . . . ,N — 1, 
then xn > e 2 and J2j=i \ x j ~ Pj\ — 2e. Using T(a) = T(l + a) /a < 1/a for 
< a < 1 and the fact that aj > e b /a, we obtain 

2 



■Pv^Xj - Pj \< e 2 , Xj>j,j = l,...,N 

^ r(m) ^ f-ife^)- 1 ) aj _i , 

> Tf / X; dxj 

>T{m){e 2 /2)( N - l Xe h /a) N 



>Cexp(-cA^loge" 1 ). □ 



8. Tail mass of Dirichlet posterior. To obtain a posterior convergence 
rate e n for Dirichlet mixtures, we need to show for some sufficiently large a 
that 

(8.1) EII n (F : F[-2a, 2a] c > e 2 n \X u . . . ,X n ) -h. 0. 

In [9], (8.1) was derived by showing that the prior mass of the set in the 
display is exponentially small. This forces us to increase a with n sufficiently 
fast, but in the present situation, this method would lead to very restrictive 
tail conditions on the Dirichlet base measure. Instead, when the true distri- 
bution is compactly supported, we shall verify (8.1) for a fixed large a by 
calculations using explicit properties of the Dirichlet prior and posterior. 
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Lemma 11. Let the true distribution of X\,...,X n be i.i.d. Pq. If the 
model is as described in Section 2 and a has a positive and continuous 
density on [—a, a], then for any e > and < b < ao~~ l , there exists K not 
depending on n such that 



E 



Pr(F[-2a,2a] c > e\X 1 ,...,X n )ll max |JSQ| <a 

I l<i<n 



< EPr(, > ba n \X h ...,X n ) + f^f, + Kne^'^. 

e(a(K) + n) 

Moreover, if Pq is compactly supported and satisfies the assumptions in Sec- 
tion 1.2, a has positive and continuous density on an interval containing 
the support of Pq, b n — > oo is a sequence with b n a n — ► 0, n£~ 2 e _a2//46 ™°™ — ► 

2 

and Pr(cr > b n a n ) = o(e~ ne ™) /or a sequence e n such that (4.2) holds, then 
(8.1) ZioWs. 

Proof. To prove the lemma, it is useful to describe the Dirichlet prior 
and the observations from the Dirichlet mixtures structurally as follows: 

• F ~ D a and a/a n ~ G, independently; 

• given (F,o~), the variables 9i,...,9 n are an i.i.d. sample from F; 

• given (F, a, 9\, . . . , n ), the variables e±, . . . , e n are i.i.d. N(0, a 2 ); 

• the variables X\, . . . , X n are defined as X{ = 9{ + ej. 

Let G n (s) = G(s/a n ). Given (6\,. . . , 9 n ), the observations Xi, . . . , X n are in- 
dependent of F and hence the conditional distribution of F given (X\ , . . . , X n , 
6i, ... , 9 n ) is independent of X\, . . . , X n . This allows us to write 



Pr(F[-2a,2a] c >e\X u ...,X n ) 

= E(Pr(F[-2a,2a] c > e\0 u . . .,9 n )\X u ...,X n ). 

It is well known (cf . [4] ) that the conditional law of F given 9±, . . . ,9 n is the 
Dirichlet distribution with base measure a + 2~27=l ^ n particular, 

F[-2a,2a] c \9 l ,...,9 n 

~ beta(a[-2a, 2a] c + N[-2a, 2a] c , a[-2a, 2a] + N[-2a, 2a]), 

where iV(.A) = >~27=i £ We can use the preceding display and Markov's 
inequality on the inner expectation on the right-hand side to see that 



Pr(F[-2a,2a] c >e\X 1 ,...,X n ) 

a[-2a, 2a] c + £?=i Pr(^ G [-2a, 2a] c , a < ba n \X 1 , . . . , X n 



< 



e(a(R) +n) 
+ Pr(ci > 6aJXi,...,X n ). 
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Let 9- n = (#1, . . . , 9 n -i), H(9i, . . . , 9 n ) be the joint distribution of (9\, . . . , 9 n ), 
Hn(9 n \9- n ) be the conditional distribution of 9 n given 9- n and H_ n (9- n ) 
the marginal distribution of (9_ n . Bayes' formula then gives 

Pr(0 n e [-2a, 2a] c , a < 6<7 n |Xi, . . . , X n ) 

jo / Wg^ajc dHj^\t-n)_ rfg-ng-j rfG n (s) 

The conditional distribution .ff n (-|0_ n ) of # n given #_ n is 

#i, with probability l/(a(R) + n — 1), i = 1, . . . , re — 1, 
~a, with probability a(R)/(a(R) + re - 1). 

Thus, with (5 a lower bound for the density of a on [—a, a] and s <a, 

a (W>\ rX„ 



a(R) + n - 1 Jx„-s 



> — ^ e-WSs, 

~ a(R) + n-l 



provided that s < a. Thus, the integral in the denominator of the Bayes 
formula is bounded below by 

(8.2) -gjgj e-VH f 1 [j{s-^-^^dH. n {t. n )dG n {s). 
a{R) + re - 1 Jo J fj[ 

We now upper-bound the numerator. For \X n \ < a and t n £ [—2a, 2a] c , we 
have that (X n — t n ) 2 > a 2 , so for any s < 6a n it follows that 

where = sup{s~ 1 e _a2//4s2 : s > 0}. This leads to the bound 

(8.3) A e- a2 / 4bV " / / J] s ^ e -^-U?l^ dH. n (t. n ) dG n (s). 

Jo J t=i 

As bo n < a, the ratio of the integral in (8.3) over that in (8.2) is bounded by 
1. Thus, we may bound the expression in the Bayes formula by Kne~ a2 ^ b2(T ™ , 
for some constant K. Putting this into the bound for Pr(i ? [— 2a, 2a] c > 
e|Xi, . . . ,X n ), we complete the proof of the first assertion. 

For the second assertion, observe that the restriction X{ £ [—a, a] is re- 
dundant for sufficiently large a. Replace e by e 2 , where e n satisfies (4.2), 
and b by a sequence b n that satisfies the given conditions. It then follows 
from Fubini's theorem (cf. the proof of Theorem 2.1 of [8]) that Il(<7 > 
b n a n \Xi, . . . ,X n ) — > in PQ-probability. Since ne\ — > oo, the result follows. 
□ 
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9. Proof of Theorem 1. We apply Theorem 5, with e n given by (2.1), 
and 



V n = { PF , a :F[-a,a] c <e 2 n }, 



Vn,j = {PF,a ■ F[-a, af < e 2 n , V a n <a< 2 J+L a n }, j = 0, ±1, . . . , 

where [—a/2, a/2] contains the support of po. 

Let e n = (nan) -1 / 2 logn V a 2 logn, which is smaller than e n . In the second 
part of the proof, it will be seen that (4.2) holds when e n replaces e n . If we 
choose b n to be a sufficiently large multiple of (rae; 2 ) 1 / 7 , then Pr(<r > b n a n ) < 
e -/3bZ < g-cne^ ^ an arbitrarily large constant c and b n a n < a n 1 ^ 7 (logn) 2 / 7 V 
(ncr^ 7 logn) 1 / 7 , which goes to as a power of n up to a log factor, since 
< a n < n~ a2 and a2 > (4 + 7) -1 . Thus, all conditions of the second part 



-ai 



of Lemma 11 hold and hence U n (Vn\Xi, . . . ,X n ) — > in probability. 

By Lemma A. 3 of [9], the Li-distance between pp^ a and PF',a for F 1 equal 
to F restricted and renormalized to [—a, a] is bounded above by 2F[— a,a] c . 
Therefore, for e > e n , we have, with Vn^fo as i n Lemma 3, 



\ogN{Ze 2 ^ 



nji || " 111; 
2 



l^+fe+0(>°4)H^)l 



for e 2 < 1/2 by Lemma 3. Hence 

logN{VSe,V nj ,h) <logN{3e 2 ,V n:i ,\\ ■ ||i) 

. a ( a2~i\ 2 
2-J_ log-— , Wa n <a, 
< ) 0" n V £ z cr n J 
~ i / 1 \ 2 

log ^ J , 2->CT n > a. 

It follows that N(e n , V n j,h) is bounded by a multiple of exp(Cj 2 2 _J (T~ 1 log 2 n) 
if 2 ] a n < a, and is bounded by a multiple of exp(Clog 2 ra) otherwise, for a 
large constant C. By the assumption on the prior of a, we have 

{— /32~ 7 ' J + 1 ' • ^ n 

e-^ J ' i > 

Thus, to verify (4.1) for a multiple of e n , it suffices to show that for con- 
stants C,D,E, 

e X p(Cj 2 2"V- 1 log 2 n - D2~^) < e En£ ', 
X; exp(Ci 2 2-V~ 1 log 2 n - £>2™') < e En£ ", 

0<j<iog(a/a n ) 
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exp(C log 2 n - DT'i ) < e Ene * . 

j>\og(a/o n ) 

For the third sum, this is immediate. In the second sum, we can bound the 
factors j 2 2~ 3 by a constant and the inequality is then immediate. 

The first sum can be transformed into a sum for j = 0, 1, . . . by the change 
of variable j t— > — j. The factor j 2 can be absorbed into 2 3 at the expense 
of replacing 2 by a slightly bigger number A = 2 ri , where r\ > 1 can be arbi- 
trarily close to 1. Put S(K) = exp(KA 3 — A 1 3 ), where 7' = 7/77. To study 
the growth rate of logS'(.fr) as K — > 00, observe that KA 3 — A 1 3 is max- 
imized near jo = (7' — \og^(K/j'), leading to function value at most 
a multiple of K 1 ^ 7 ~ l \ that is, Since the series decays faster 

than geometrically, the sum in the tail is bounded by a multiple of the 
maximum term. The first jo terms together contribute at best jo times the 
value of the maximum. Thus, it follows that log S{K) = 0(^(7/(7-1))+ \ og K) 
and, clearly, the logarithmic factor can be absorbed into the power. Hence, 
in view of the fact that loge" 1 = O(logn), the requirement (4.1) becomes 
(a/a n )( 7 /( 7 ~ 1 ))+(log )( 2 7/(7-i))+ < ne 2 _ Again, the logarithmic factor may 

be absorbed into the power. Thus, the condition is satisfied in view of (2.1). 

Finally, we verify (4.2). Fix numbers b' > b > 0, to be chosen sufficiently 
large at the end of the proof. Because Pq possesses compact support, by 
Lemma 2, there exists a discrete distribution F n = J2f=iPj^zj, supported on 
N n < a^ 1 log e~ b ' points in the cr n -enlargement of the support of Pq, such 
that 

(9-1) \\p Fn>an -pp 0>CTn ||i <e^(loge- b ') 1 / 2 <4 

for sufficiently large n. This will change by 0(s b n ) if we move the support 
points of F n by 2£ b % a n , so we can assume that the support points are e b n a n - 
separated. We can then find disjoint intervals U\, . . . , Un„ with Zj £ Uj and 
X(Uj) = £ b n cr n for j = 1, . . . , N n . We can modify this to a partition of an inter- 
val that contains the support of Po into M n > N n intervals Ui,..., Uu n , such 
that each interval Uj has length £^cr n < X(Uj) < 2e b n a n for j = 1, . . . , M n , and 
such that M n < a^ 1 logn. 

Let Ef=i\F(Uj) - Pj \ < e b n and F(Uj) > £ 2b for j = l,...,M n , and 
\<J-ba n \ < e b n a n , for 6= {b l + b 2 )/2. By Lemma 5 [with U = (Ui<j<N n Uj) c ], 
we have \\pF,a n ~ PF n ,a n \\i < £ n- Furthermore, \\p F)CT - PF,a n \\i < 
\\<f>a — 4>a n ||i < \c — <J n \ / (a A a n ) < e n . By the triangle inequality and (9.1), we 

have \\pp ,cT n -PF,a\\i an d hence h(p ,PF,a) <£n 2 - Combining the pre- 
ceding inequality with Lemma 4, we conclude that h(po,PF,a) ^ c 2 + £n 2 < 
a 2 if b is sufficiently large. Now, if x € supp(j>o)> then 

P +CT " _ ( /a) 2 /2 F[x -a n ,x + a n ] £ 2b 

PF,a{x)> / 4> a {x - z)dF(z) >e > — , 

Jx-a n cr cr n 
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because the interval [x — cr n ,x + a n ] contains at least one of the intervals 
Ui,..., U Mn - Consequently, Po(po/PF,a) < Vn/ef- 

Results of the two preceding paragraphs imply, for sufficiently large b, 
that 

:£|F(E/,-) - Pj \ < e b n ,F{U 3 ) > 4 6 , j = l,...,M n ,\a- ba n \ < e b n ajj 

C { (F, a) : h(po,p F ,a) < crl P (^-) <%)■ 

The densities pF,a with (F,a) as in the last set are contained in B(po, 
c$o\ logn) C B(po, CiEn) for a sufficiently large constant C4, in view of Lemma 7. 

By construction, X(Uj) > E b n a n and hence a(Uj) > e b n a n for every j = 
1, . . . , M n . Furthermore, for sufficiently large b, we have £ b n M n < 1. By Lemma 10 
(with pj = for N n < j < M n and a different constant b, as in the present 
proof), we conclude that the prior mass of the set B{pQ,c^e n ) is bounded 
below by a multiple of e b n exp (— cM n log e" 1 ) > exp(— c / cr~ 1 (logn) 2 ), proving 
the first statement. 

For the proof of the last statement of the theorem, we follow the same 
steps, but we redefine e n by (2.2). The verification of (4.1) needs no changes, 
but we adapt the verification of (4.2) as follows. Fix b' > b > 0. Because 
Po possesses compact support, by Lemma 2, there exists a discrete distri- 
bution F n = J2f=iPj^z 3 supported on N n < cj~ l \oge~ b ' points in the a n - 
enlargement of the support of Pq such that (9.1) holds for sufficiently large 
n. The proof of Lemma 2 shows that we can satisfy F n (Ij) = Po(Ij) for ev- 
ery interval Ij in a covering of the support of Pq by M n < a~ l intervals of 
length a n . We can assume that the support points separated. We 

can then find disjoint intervals U±, . . . , U^ n with Zj € Uj and X(Uj) = £ b n a n 
for j = 1, . . . , N n , and such that each Uj is contained in some interval Ik- 

Suppose that F is a probability measure satisfying J2f=i \ F{Uj) — Pj\ < 
o"^e^ and that a is a number with \a — ba n \ < £ b n cr n for b = (px + &2V2. 
As before, this implies that h(pp 0tO - n ,pF :(T ) < En 2 . Moreover, for every x G 
supp(po), PF,a{x) > (Jn l F[x - (T n , x + a n ] > a" 1 mffij F(Ij), because the in- 
terval [x — a n ,x + er n ] contains at least one of the intervals I±, . . . ,lM n - By 
construction, F n (Ij) = Po(Ij), which is bounded below by a multiple of o^, 
by assumption, for every j. Hence, 

F(Ij)> £ F(Ui)> J2 P t -<e b n = F n {I 3 )-<e b n >a a n . 

i:U x Clj r.UiClj 

Consequently, pfA x ) ~ C 1 if x G supp(p ) and so ||po/Pivl|oo < o-n~ a - 

By Lemma 4, we have that h(po,pp 0j(Tn ) < a 2 . Therefore, since Po/pp ,a n is 
bounded by Lemma 6, we conclude by Lemma 8 that Po(log(po/pp 0:<7n )) k < 
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cr^, k = 1,2. With the help of Lemma 9, with p = po, q = pp (ha „ and r = 

p F>cn we see that Po(log(poMv)) fc < a* + e\ + e% 2 a x ~ a < cr£, k = 1, 2, for 
sufficiently large 6. Combining the results of the three preceding paragraphs, 
we see that, for sufficiently large 6, 

| (F, a) : ]T \F(Uj) - Pj \< a%e b n , \a - ba n \ < e b n ajj 

C | (F, a) : P (log * < o$, k = 1, 2} C B(po, c 5 a 2 ). 

By construction, A(£/j) = £^<7 n and hence a(Uj) > £^cr n for every j. By 
Lemma 10, we conclude that the prior mass of the set B(po, c^a 2 ) is bounded 
below by a multiple of e b n exp(— cN n loge~ b ) = exp(— c / cr~ 1 (logn) 2 ), prov- 
ing (2.2). 

The validity of the final remark is clear from the proof, as there are only 
finitely many terms when G is compactly supported. 

10. Proof of Theorem 2. Fix a smooth function w : R — > [0, 1] with sup- 
port [—2,2] that is identically 1 on [—1,1] and let w n (x) = w(x/k n ) for 
k n = (logra/c') 1 / 7 for some d < c. Define new observations X±, . . . ,Xn from 
the original observations X\, . . . ,X n by rejecting each independently 
with probability w n (Xi). Because Pq[— k n , k n ] c = o(n _1 ) by the tail assump- 
tion on Pq, the probability that some Xi is rejected is actually o(l) and 
hence the posterior distributions based on the new and the original obser- 
vations are the same with probability tending to one. In particular, they 
have the same posterior rate of convergence. The new observations are a 
random sample from the density p n that is proportional to pow n . Because 
\Jp w n dX - 1| < P [-k n ,k n ] c = o(n _1 ), we have that h k (p n ,p ) = o(n~ l ) 
for every k. Hence, it suffices to show that the posterior based on the new 
observations concentrates at rate e n around p n . 

We shall establish this by means of an obvious triangular array version of 
Theorem 2.1 of [8] [with the only difference being that we treat H n ('P^) — ► 
in Pg^-probability directly instead of through their condition (2.3)]. We verify 
this for V n = {p F , a : F[2A; n , 2k n f < 2e n } and e n = max{(na n )- 1 / 2 (logn) 1 + 1 /^\ 
a 2 log n} . We choose w such that f(w'/w) 4 w dX < 00 and f(w" /w) 2 wdX < 00. 
Then Jip'JPn^PndX = 0(1) and J ' (p'UPn) 2 PndX = O(l), and hence 

h 2 {p n ,Pn *4>a n ) = 0(<7 2 )> b y Lemma 4. 

The verification of the entropy bound can proceed as before, except that 
we obtain an additional logarithmic factor by the dependence of a n on n, as 
follows: 

logN(SelP n , || • ||i) < - flog < -(logn) 2 + 1 /(27) < n£ 2. 
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For the verification that II n ('P^|Xi, . . . , X n ) — > 0, we use Lemma 11, where 
it suffices that 

(10.1) e -4/(&£6g)i 1 ,o. 

e n a n iahx\ t \< kn a'{t) 

This is certainly the case under the tail condition on a'. 

We adapt the verification of prior concentration rate in Kullback-Leibler 
neighborhoods as follows. Fix b' > b > 0. Because P n possesses support 
[— 2k n ,2k n ], by Lemma 2, there exists a discrete distribution F n = Y^j=iPj$Zj 
supported on N n < k n a~ x loge~ fc ' points in the interval [— 2k n — a n , 2k n + a n ] 
such that 

(10.2) ||PF„,^-pp„, CT J|i<d'(log^ b ') 1/2 <d, 

for sufficiently large n. Because this distance changes by 0(e b n ) if we move 
the support points of F n by 2e b n a n , we can assume that the support points 
are s^cj^l - 

separated. We can then find disjoint intervals U\,...,Uj^ n with 
zj € Uj and \{Uj) = ^ n o n for j = 1, . . . , N n . We can modify this to a partition 
of the interval [— 2k n ,2k n ] into M n > N n intervals U\, . . . , Um„ such that 
each interval Uj has length contained in [e b n a n ,a n ] and such that M n < 
k n a~ l log e~ b . 

Let F satisfy Y$=\ \ F (Uj) ~ Pj\ < 4 and F ( u j) > £ n for 3 = 1, • ■ • > M n, 
and suppose that | cr — 6u n | < £„,°n for 6 = (6i + b-i)j2. By Lemma 5, we 
have ||j?F )(Tn -PF n ,a n \\i <4 and -P^o-nlli < 4- Applying the trian- 

gle inequality repeatedly and combining the preceding two inequalities with 

(10.2), we find that \\pp njCTn ~PF,a\\l < 4 and so h(p Pnj(Tn ,PF,a)£n 2 ■ Combin- 
ing the preceding inequality with Lemma 4, we conclude that h{p n ,pF,a) < 

a\ + En 2 < o\ if b is sufficiently large. 

For every x in the interval [—2k n ,2k n ], we have 



PF,a(x) > / 4>a{x ~ z) dF(z) 



> F[x - a n ,x + a n ] > e^ 6 



because the interval [x — cr n , x + a n ] contains at least one of the intervals 
Ui,..., U Mn - Consequently, P n (Pn/PF,a) < cfn/ef- 

Combining the above results, we see that, for sufficiently large b, 

| (F, a) :J2 \F(Uj) - Pj \ < ^FQUj) > s 2b ,j = l,...,M n , \a n - a\ < 4<7 n j 



C { (F, a) : h(jp n ,p F ,a) <vlPn(—)<^ 

I \PF,aJ ei b 



The densities PF,a, with (F, a) as in the last set, are contained in B(p n , c^a^ logn) 
for a sufficiently large constant C5 in view of Lemma 7. 
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By construction, X(Uj) > £ n <7 n and every Uj is contained in the interval 
[—2k n ,2k n ]. By the lower bound assumption on a', we see that a(Uj) > 
min| 4 |< fen a'{t)e b n a n > n~ e for every j = 1, . . . , M n and some e > 0. Further- 
more, we have e b n M n < 1 if we choose b sufficiently large. By Lemma 10, we 
conclude that the prior mass of trie set B{p n ^ c 5 e n ) is bounded below by a 
multiple of exp(— cM n log e" 1 ) = exp(— c'cj~ 1 (logn) 2+1 / 7 ). 

11. Proofs of Theorems 3 and 4. Let V n be Va n ,a„ in the notation of 
Theorem 6. Hence, its bracketing entropy integral is bounded by 




a n (, a n \ 2 , ^ /On , 
-log de<J — e„log- 



V a n V eCTn/ V ffn 0"n£n 



For e n = (a n /(rac7 n )) 1 / 2 logn, this is bounded above by a multiple of \fne\. 
Because po has compact support, V n contains p PQ ^ n = Po * <j>a n > a * l eas t if 
b\ < 1 < &2j as we shall assume for simplicity. By Lemma 6, the quotient 
Po/Ppo,(Tn i s bounded above uniformly in n. The distance h(po,p PO:(Tn ) is of 
the order 0(<r 2 ), by assumption. Hence, Theorem 3 follows by an application 
of Theorem 4 in [16], or Theorem 3.4.4 in [14]. 

The sieve V n given by (3.2) is equal to the set VA,a n considered in Theo- 
rem 7. For the given function A, the function A in this theorem can be taken 
to be equal to a multiple of 1 — <l?(ra) if 5 < 1/2 for some < r < 1, and equal 
to A(ra) for some r < 1 if 5 > 1/2. Therefore, A' 1 ^) < (loge _1 )( lv25 ^ 2 and, 
in view of Theorem 7, the bracketing integral of V n is bounded by 



which is 0(-y/ne 2 ) for e n = (n<T n ) _1 / 2 (logn) 1+ ^ lv2 ' 5 - ) / 4 . The remainder of the 
proof can be completed as before. 
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